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Abstract. We consider in this paper an urn and ball problem with replace- 
ment, where balls are with different colors and are drawn uniformly from a 
unique urn. The numbers of balls with a given color are i.i.d. random vari- 
ables with a heavy tailed probability distribution, for instance a Pareto or a 
Wcibull distribution. We draw a small fraction p < 1 of the total number of 
balls. The basic problem addressed in this paper is to know to which extent we 
can infer the total number of colors and the distribution of the number of balls 
with a given color. By means of Le Cam's inequality and Chen-Stein method, 
bounds for the total variation norm between the distribution of the number 
of balls drawn with a given color and the Poisson distribution with the same 
mean are obtained. We then show that the distribution of the number of balls 
drawn with a given color has the same tail as that of the original number of 
balls. We finally establish explicit bounds between the two distributions when 
each ball is drawn with fixed probability p. 



1. Introduction 

We consider in this paper the following urn and ball scheme with replacement: 
An urn contains a random number of balls with different colors. We draw a small 
fraction p <C 1 of the total number of balls. A ball which has been drawn is 
replaced into the urn. The problem considered in this paper consists of estimating 
the number of colors together with the distribution of the number of balls with a 
given color by using information from sampled balls. This problem is motivated by 
the analysis of packet sampling in the Internet (see Chabchoub et al. [5] for details). 

To address the above problem, we analyze the non-normalized distribution of 
the number of balls drawn with a given color. More specifically, let Wj (respec- 
tively, Wj) denote the number of colors with a number of sampled balls equal to 
(respectively, equal to or greater than) j. Denoting by K the number of colors 
seen when drawing balls, the quantities Wj/K and Wj~/K are equal to the pro- 
portions of colors, which at the end of the trial comprise exactly or at least j balls, 
respectively. 

The numbers of balls with various colors are assumed to be i.i.d. random vari- 
ables and the number K of colors is large. In addition, the distribution of the 
number of balls with a given color has a heavy tailed probability distribution of 
Pareto or Weibull type. Finally, balls are drawn uniformly. This means that for 
each i = X, . . . ,K, if there are Vi balls with color i, the probability of drawing a 

ball with this color is Vi/V, where V = v± H + Vk is the total number of balls 

in the urn. 
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The above model is defined as the "uniform model". It will compared against 
the case when balls are drawn independently one of each other with probability p. 
This model will be referred to as probabilistic model. We show that the results 
obtained in both models are close one to each other when p is very small. But 
there are some subtle differences between the two models, notably with regard to 
the achievable accuracy in the inference of original statistics. It turns out that the 
probabilistic model is simpler to analyze than the uniform model but yields less 
accurate results. This is due to the fact that we cannot exploit the fact that the 
number of colors is very large. 

One of the main results of the paper concerns the analysis of the validity of the 
following simple scaling rule: The distribution of the original number Vi of balls 
with color i could be estimated by that of the random variable Vi/p, where Vi is 
the number of sampled balls with color i. When each ball is drawn with a fixed 
probability, it is known that this rule is valid for tails of the distributions as soon 
as they are heavy tailed. See Asmussen et al [3] and Foss and Korshunov |7J where 
this asymptotic equivalence is proved in a quite general framework. Our main goal 
here is to get, for j > 2, an explicit bound on the quantity 

g(g > 3) 1 
P(v > j/p) 

In the context of packet sampling in the Internet, explicit expressions are especially 
important for the estimation of the sizes of flows in Internet traffic. In this setting 
the variable j is taken to be large but cannot be too large so that the event {v = j} 
occurs sufficiently often to obtain reliable statistics. Henceforth, the dependence 
on j should be made explicit. See Chabchoub et al. [5] for a discussion. 

The organization of this paper is as follows: The notation and the basic results 
used in this paper (Le Cam's inequality and Chen-Stein method) are presented in 
Section [2j The mean values of the random variables Wj and Wj are computed in 
Section [3l The approximation of the distribution of by a Poisson distribution 
and the validity of the scaling rule are investigated in Section 3] We compare in 
Section [5] the original distribution of the number of balls with a given color against 
the rescaled distribution of the number of drawn balls with the same color. Some 
concluding remarks with regard to sampling are presented in Section [5] 

2. Notation and basic results 

2.1. Definitions and assumptions. We consider an urn containing vi balls with 
color i for i = X,..., K. The quantities Vi are independent random variables with 
a common heavy tailed distribution. In the following we shall consider two families 
of heavy tailed distributions for the number v of balls with a given color: 

Pareto distributions: The distribution of v is given by 

(1) F(v >x) = {b/x) a , x>b, 

with the shape parameter a > 1 and the location parameter 6 > 0. The 
mean of v is ab/(a — 1). 
Weibull distributions: The distribution of Vi is given by 

(2) P(v > x) = exp(-(a;/?7)' 3 ), x > 0, 

with the skew parameter fj € (0, 1) and the scale parameter rj > 0. The 
mean of v is 5r(l//3), where T is the classical Euler's Gamma function. 
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The total number of balls in the urn is V — Xa=i v i- We draw only a fraction 
p of this total number of balls. Each ball is drawn at random: A ball with color 
i is drawn with probability Vi/V. After drawing the pV balls, we have Vi balls 
with color i. Of course, only those colors with Vi > can be seen. The quantity 
K = X)i=i l{Si>o} is the number of colors seen at the end of a trial. 

In the following, we shall be interested in the asymptotic regime when the number 
of colors K — + oo while the fraction p — > 0. Note that by the law of large numbers, 
V — ► oo a.s. (the total number of balls in the urn is very large). 

The random variables we consider in this paper to infer the original statistics 
of the number of balls and colors are the variables Wj and Wj~ , j > 0, defined as 
follows. 

Definition 1 (Definition of Wj ) . The random variable Wj is the number of colors 
with j balls at the end of a trial and is given by 

j > 0, Wj = % 1=J - } + t{v 2 =j] + • • • + l{v K =j}, 

where Vi > is the number of balls drawn with color i (which can be equal to 0). 

Definition 2 (Definition of W^). The random variable is the number of colors 
with at least j balls at the end of a trial. The random variables W^~ are formally 
defined by 

j > 0, W+ = l{r Vl > j} + l{v 2 > j} +■■■ + l{v K >j}- 

Note that we have 

Vj>o, wf = J2 w *- 

e>j 

The averages of the random variables Wj are in fact the key quantities we shall 
use in the following to infer the original numbers of balls per color. 

2.2. Le Cam's inequality and Chen-Stein method. Le Cam's inequality gives 
the distance in total variation between the distribution of a sum of independent and 
identically distributed (i.i.d.) Bernoulli random variables and the Poisson distri- 
bution with the same mean (see Barbour et al. [1]). Note that if V and W are 
two random variables taking integer values, the distance in total variation between 
their distributions is defined by 

\\V(W e •) -¥{V £ -)\\tv d = sup \F(W E A) -F(V G A)\ 

ACN 

- i]T|P(^ = n)-P(^ = n)|. 

Theorem 1 (Le Cam's Inequality). If the random variable W — ^Zih, where the 
random variables are i.i.d. Bernoulli random variables, then 

(3) e •) - HQnw) g Olk^E^ 7 ^ 1 ) 2 ' 

i 

where for A > 0, Q\ is a Poisson random variable with mean A, that is, for all 
n>0, 

A" 

P(Qx = n) = —e-\ 
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When the random variables Ii appearing in the above theorem are not indepen- 
dent but satisfy a specific condition, referred to as monotonic coupling, it is still 
possible to obtain a bound on the distance between the distribution of the sum 
W = J2i Ii arm the Poisson distribution with mean £(1^). 

Definition 3 (Monotonic Coupling). The variables Ii are said to be negatively 
related, when there exist some random variables U; and V> such that 



(1) U t dl = W and 1 

(2) Vi<Ui. 



h 



i); 



The main result of the Chen-Stein method is given by the following theorem (see 
Barbour et al. [4]). 

Theorem 2. // the monotonic coupling condition is satisfied, then the following 
inequality holds 



(4) 



||P(W e •) - P(Q EW e •) 



< l - 



Var(W) 
E(W) 



When the monotonic coupling condition is satisfied, in order to prove the Poisson 
approximation, it is sufficient to show that the ratio of the variance to the mean 
value of W is close to f ; this is a very weak condition to prove in practice. 

It should be noted (see [8]) that Relation (j4|) can be used not only when E(W) 
takes bounded values so that W is approximately a Poisson random variable, but 
also when E(VT) is large. In this case Chen-Stein Method yields a central limit 
theorem: If M is a standard normal distribution. 



W-E(W) 
v /Var(VF) 



¥{JV e •) 



< 



W-E{W) 
v /Var(VK) 



Q nw) ~E(W) g 
v /Var(VK) 

p ( Qe {W )-HW) 
{ /VarW) 



-P(7V g 



where Var(W) is the variance of the random variable W . 
By using Relation (j4|), we have 



W-E(W) 
v /Var(VF) 



P(A^G •) 



< I 



Vax(W) 
' E(W) 

'Q nw) -E{W) 
^ v/Var(W^) 



-f{N G •) 



If the ratio E(VF) /Var(W / ) is close to 1, then the first term in the right hand side of 
the above relation is negligible. In addition, the classical central limit theorem for 
Poisson distributions implies that when E(W) is large, the second term is negligible 
too. Therefore, we have W - E(W) + ^Va,r(W)N with a bound on the error. 
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3. Computation of mean values 

3.1. Bounds for mean values. By using Le Cam's inequality, we can establish 
the following result for the mean value of the random variables Wj . 

Proposition 1 (Mean Value of Wj). If there are V balls and K colors in the urn, 
for j > 0, the mean numberE(Wj) of colors with j balls at the end of a trial satisfies 
the relation 

E(Wj) 



(5) 



( ■ v 
< E yaan{pv, 1) — 



K 

where Q is the probability distribution defined for j > by 



E 



p is the sampling rate, and v is distributed as the number of balls with a given color. 
Proof. We have 

v i =B i 1 +Bi + --- + B; v , 

where B\ is equal to one if the £th ball drawn from the urn has color i, which event 
occurs with probability Vi /V , the quantity V being the total number of balls in the 
urn. 

Conditionally on the values of the set T = {vi , . . . , Vk }, the variables (B\,£ > 1) 
are independent Bernoulli variables. For 1 < i < K, Le Cam's Inequality (J3j> 
therefore gives the relation 

2 



(vie-\^)-F(Q pVi e-)\\ tv < P ^, 



V, 

I— 

V 

and Relation @ which can also be used in this case yields 

\\m^-\^)-nQpv^-)\\ tv <^, 

By integrating with respect to the variables v\, . . . , Vk, these two inequalities give 
the relation 

(6) \\F(vi€-)-®\\ tv <E(mm(pv,l)^). 

Since E(W}) = X^i^i ^(vi = i)j by summing on i = 1, . . . , K, we obtain 

\E(Wj) - KQj\ < KE (min (pv, 1) ^ 

and the result follows. □ 

By using the fact that E(W J + ) = J^i=i^ > (^ — j)> we can deduce from Equa- 
tion © the following result. 

Proposition 2 (Mean Value of Wj~). If there are V balls and K colors in the 
urn, the mean number E{W^) of colors with at least j > balls at the end of an 
arbitrary trial satisfies the relation 



(7) 



K ^ 



<E(min(p«,l)-^) , 



where the probability distribution Q is defined in Proposition^ 



6 CHRISTINE FRICKER, FABRICE GUILLEMIN, AND PHILIPPE ROBERT 

We immediately deduce from Propositions [T] and [2] the following corollary by 
using the fact that V > K . 

Corollary 1 (Asymptotic Mean Values). The relations 

Urn ^E(Wj)=Qj and lim ^-E(W+) = V Q t - 

e>j 

hold. 

Note that if balls are drawn with probability p independently one of each other 
(probabilistic model), we have Vi = y^lx Bg, where the random variables B\ are 
Bernoulli with mean p. By adapting the above proofs, we find 



(8) 



e(w 3 : 



K 



<P- 



3.2. Asymptotic results for specific probability distributions. 



3.2.1. Pareto distributions. Let us first assume that the number of balls of a given 
color follows a Pareto distribution given by Equation |T]). Then, we have the fol- 
lowing result when the number of colors goes to infinity. 

Proposition 3. If v has a Pareto distribution as in Equation (TTJ), then for all 
j > a, the relations 

if->+oo E{Wj) j + 1 

(10) lim ^ = a( P or r -^l + { ( pb y), 

r+ 



(11) J™ — TT- = iP b ) a T^T+0 



hold. 

Proof. For j > a, 



(12) Q, =E| ^- e - pv ) ^ab a ^- [ + v?- a - x e- u du 

\ v- / r- Jpb 



p°_ r x 

3 ] - Jpb 

, a r(i-a) _ w /* , 



Therefore, by using the relation T(x + 1) = xT(x), we get the equivalence 

o(( P by- a ), 



lj+i _ J - a 



J + l 
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which gives Equations ((9]) and (I10|l by using Corollary [TJ For the mean value of 

+ 



Wf, Equation QJ) gives the relation 

E(W, 



K-*+oo K ^ n\ \l — pb 

n>j x 

( t -4- n -4- r?. I 



n>0 



r(j + n + 1) n! \ 1 — pb 



a(pbT^L-^F(j-a,l; J + l;l) + 0( (pb)J 



j\ '" \ ! /,/. 

where -F(a, 6; c; z) is the hypergeometric function satisfying 

, , . T(c)T(c-a-b) 

F(a,b;c;l )= )> \ -( 

I (c — a)I (c — b) 

(see Abramowitz and Stegun [1]), and Equation (fTTj) follows. □ 
The shape parameter a can be estimated via Relation by 

(18) a _ ta 

v ' k^oo J y E(Wf) J \l-pb 

for all j > a. This gives a means of estimating the shape parameter a. When ob- 
serving drawn balls, we have in fact only access to the quantity E(K) of the number 
of sampled colors. While this has no impact for the estimation of a, this correcting 
term is important when estimating b from Equation (jlip . It is straightforward that 

K 

i=i 

and then when K — > oo 

E(K) ~ K(l -Q )=K(1- E(e- pv )) . 

Since 

poo 

(14) 1 - E(e~ pv ) =p e~ px V{v > x)dx = bp + (bp) a T(l - a, bp), 

Jo 

where T(a, x) is the incomplete Gamma function defined by T(a, x) — t a ~ 1 e~ t dt, 
we can use the above equations together with Equation (|11[) in order to estimate b 
and then K. It is also worth noting that 1 — E(e~ pv ) ~ bp when a > 1 and bp — > 0. 

3.2.2. Weibull distributions. We assume in this section that the number of balls 
with a given color follows a Weibull distribution. In this case, we have the following 
result, which follows from a simple variable change and the expansion of exp(— x 13 ) 
in power series of x$ or exp(— px) in power series of x; the proof is omitted. 

Proposition 4. If v has a Weibull distribution with skew parameter (3 and scale 
parameter r), then for < (3 < 1 

{ _ 1}n r((n+l)/3 + j) 



(15) lim E(W j+1 ) = fYl 
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and for [3 > 1, 

(16) lim E(W ]+1 )^^±tlflr(^l + 

J n=0 v ' 

Note that E(Wj) can be written in the form 

WW-) = — — [ u j+l3 ~ 1 e~ u+tuf 'du 

with t = — l/(pn) 13 . The above integral is known in the literature as to be of the 
Faxen's type and can be expressed by means of Meijer G-function, when (3 is a 
rational number, see Abramowitz and Stegun pQ. 

Contrary to the case of Pareto distribution for the initial distribution of balls of 
a given color, there is no simple relations giving the parameters (3 and r\ from the 
mean values E(Wj), j > 1. In fact, we shall prove in the following that P(v > j) 
has also a Weibull tail. This eventually gives a means of identifying the parameters. 

4. POISSON APPROXIMATIONS 

In the previous section, we have established bounds for the mean values of the 
random variables W, and Wf . To obtain more information on their distributions, 
we intend to use Chen-Stein method. For a fixed environment (namely fixed values 
of the quantities Vi for i = l,...,K), these random variables appear as sums 
of non independent Bernoulli random variables. A preliminary analysis of the 
Bernoulli random variables appearing in the expression of Wj reveals that it seems 
not possible to invoke a monotonic coupling argument. It is well known (see [1] 
for details) that the situation is more favorable with the random variables and 
we can specifically prove that if T is the set T — {vi, 1 < i < K}, then the total 
number W„ + of colors with at least j balls at the end of the trial satisfies the relation 

3 J 



(17) P(w/e- I r)-T(Q Elw t m e-) 



tv~ \ E(W+ I T) 



Indeed, given the random variables Vi, the model is equivalent to a standard urn 
and ball problem consisting of putting pVi balls into K urns, a ball falling into 
urn i with probability pi — Vi/Vi. The number of balls in urn i is the number of 
balls with color i in the original urn and ball problem. Even in the case when the 

quantities pi are different, the variables 1^ = l^^j} are negatively related so 
that Theorem [2] can be used. See Page 24 and Corolary 2.C.2 Page 26 of [4] for a 
definition and the main inequality in this domain. Chapter 6 of this reference is 
entirely devoted to related occupancy problems. 

The rest of this section is devoted to the estimation of the bound in Equa- 
tion (fP7|) . We first establish the following lemma. 

Lemma 1. For a fixed environment T = {vt, 1 < i < K}, the distance in total 
variation between the distribution of and the Poisson distribution Q K ^ W + | j*r) 
satisfies the inequality 

(18) lim \\P{W+ e • | F) - F(Q E(W+ i r) e Oil*. < + J/'n ' 

k^+co J ^\vv k \ j-> mj(p) E{v) m.j(p) 
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where mj(p) and m,2j(p) are the first two moments of the random variable defined 
by 

(19) *^) = £^-^ 

e>j 

and the prime sign denotes the derivative with respect to p. 

Proof. For T fixed, the number Wj of colors with j < pV balls at the end of the 
trial is such that 

«wm-f(^(^(i-sr j . 



By using the fact that 



I / 1 

o—| a.s. 



V KE(v) \K 
for large K, straightforward calculations show that 

\ 

j\ " y 2pKE(v) ' 2E(v)K ) ' " \K 

'(pvi)* _ pv . p d 2 { „- PVi (m) j \\ , n f i 



(20) B^i^f^-f, « + |^ 



1 



1=1 
A" 

= ^V i ! " 2E{v)K dp 2 y j\ )) '"\K / 

By summing up the terms above and by checking that the o (-i) term remains valid, 

since the sum can be written as Y^f=\ f{ v i) e ~ pVi / K 2 > where / is a polynomial, we 
have for j > 1 and < p < 1 

nw+ 1 ^ = ^E(wi i ^ = f; jr w o,) - i ^E I "»+"(f) - 

£>j i=l V ' i=l ^ ' 

where 

(xvi)*_ xvi 



e>j 

For the variance, if Iij is 1 if color i has exactly j balls at the end of the trial 
and otherwise, then Wj = Yld=i h,j an d, f° r j ^ 

E(W 3 W e | T) = ]T E(Ii,jI m , e | T) 

l<ijim<K 

and 

nw] i ^ = e(Wj \?)+ J2 m^im,) i n 

\<i^m<K 

For j, £ such that j + £ < pV, 

( P v)\ fv m Y f, vi + vn xpV ~ J ~ " 

V 

The quantity in the right hand side of the above equation can be expanded as 



P ~p{vi+v m ) j+l.J „1 „ p -p{vi+v m ) J .1 / 1 

; ' x m -Ci, m {J,£) + o { 



jW 2V jW. l ' mu ' ' \K)' 



10 



CHRISTINE FRICKER, FABRICE GUILLEMIN, AND PHILIPPE ROBERT 



where 

c i,m ( 

is such that 



Ci,m{j, I) = P° +l - 2 (j + + 1-1)- 2(j + t){ Vi + V^^- 1 + ( Vi + V m fp> +t 

Ci,m\J j % ) — 



jW. ^W~> dp 2 j\ M 

Since 

(^ + ) 2 =(E^) = E WkWe + Y.Wl 
\i>j J e^k>j i>j 

mw+f i t) nw+ 1 F) = E E Wi,ki m ,t i ^ 

l<ijim<K e,k>j 



E (^j(p)^-(p) 



and 

Var(W+ | ^) _ E(W+ \ T) ~E((W+) 2 \ T)+E{W+ \ Tf 
~ E(W+ | T) ~ E{W+ | T) ' 

The right-hand side of this equation can be expanded as 



J2i=l X i,j + 0(1) y l<i=£ m <K 

V y l<i#m<if \i=l V ; i=l 

+ o 

which can be rewritten as 



-j, f V X 2 A P ) 



K K 



E (W(p)-2E^(p)EW 1 1 + °« 

V ' \l<i/m<K i=l i=l 



using that 



E A '-.' A '" - ( E ) ~ E X lr 

i^m \ i / i 



By the law of large numbers, we have that, almost surely, 

K 



J™ \ E x h (p) = E ( x l (p) ) = m2 ^- (p) > 

i=l 
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rrij (p) and 



together with 

i— 1 

Hence, 

Vax{W+ | T) _ maj(p) + 



1 



K 



lim ^^X" <li (p)=m». 

it— *+oo A ^ — ' 
i=l 



lim 1 



p[(m|)"(p)/2 - mj (p)m'i( P )]/E(v) 



E(W+ | .F) 



m 2,j(p) + pm' j (p) 2 /E(v) 
m j(p) 



Cl.S. 



and the result follows. 



□ 



To illustrate the fact that the bound in Equation l|18p is tight when p — > and 
v has finite moments of any order, let us note that, provided the corresponding 
moments are finite, 



(21) 

Moreover, 



lim 



rrij{p) 



lim 



m 2 ,j(p) 



p^O p-> 

E(v 2: > 



77 



p 2j j 
Thus, the limit when K tends to 
equivalent to 



12 



and lim 



m' 3 {p) _ E(«J) 



P ^o pi- 1 (j - 1)! ' 
-co of the bound given by Equation (fTS)) is 

jpi- 1 E(vJ) 



U - 1)! E(w) 

when p tends to 0. If j > 2, this term tends to when p — ► 0. 

By using the above lemma, we are now able to state a limit result for the distri- 
bution of the random variables Wt ■ 



Proposition 5. The inequality 



(22) lim sup 



K 



mt) 



E(W+) 



< 



< 



■ du 

2tt 
™ 2 ,j{p) 



P Hip)) 2 



rrij (p) E(v) TTij (p) 



holds. 



Thus, for j > 2 and for small p, this gives the following approximation 



W 



+ ~ E(W+) + ^/E(W+), 



where G is a standard normal random variable. It should be noted nevertheless 
that Equation ([22)) is almost a central limit result but because of the scaling in 



1/ yE(Wj~) instead of 1/y Vax(Wj~ ), the bound in the right hand side is not as 
K gets large but, according to the proof of Lemma [TBI only an upper bound on the 
distance between E(W^) and Vax(Wj). 
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Proof. From Lemma [TJ we have 



T 



Q 



E{W+ \F) 



E(W+\f) 



< 



m 2 ,j{p) P m' 3 {pY 



m,j{p) E(«) rrij(jp) 



From Equation ([^0]) , we have that 



K 



Jim -E(W+ | :F) = E(X,-(p)) = Kj^Qi = *mi(p) 



where the quantities Q^ are dehned in Proposition[TJ In addition, from CorollaryQ] 
E(Wj ) ~ Krrij{p) when K — > +00. The result then follows by applying the central 
limit theorem for Poisson distributions and by deconditioning with respect to T . □ 

To conclude this section, let us notice that when balls are drawn with probability 
p independently of each other, we do not have to condition on the environment and 
we have 

E(ELiO fc (i-pr fc i{^} N2 



P(PF+e-)- 



e-) 



<- 



E (^(I-P^'l^i} 



It is worth noting that the results are independent of the number of colors and 
that we do not need take K — > 00 to obtain a bound for the distance in total 
variation. In addition, when E(Wj) become large, then it is possible to obtain a 
central limit-type approximation similar to Proposition [SJ 



5. Comparison with original distributions 

5.1. Uniform model. In this section, we compare the distribution of the number 
v of balls drawn with a given color with that of the original number v of balls with 
a given color. We are in particular interested in giving a sense to the heuristic 
stating that v and v/p have distributions close to each other. 

Proposition 6. Under the condition that the random variable v has a Weibull or 
Pareto distribution, we have 

E(W+) 
lim lim — — — -— = 1. 

j^oo K^oa KP(V > j/p) 

Proof. From Corollary [TJ we know that E(Wj)/K — > Qj when K — > 00. Since 




we can show that if v has a Weibull or Pareto distribution, then Qj ~ P(i; = j/p)/p 
when j — > 00. Indeed, the above sum can be rewritten as 
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where fj(i) = —p£ + jlog(p£), which attains its maximum at point j/p with 
fj'U/p) = ~P 2 1 j- If the random variable v is Weibull or Pareto and j/p is suf- 
ficiently large, then P(i> = £)/F(v = j/p) — 1^0 uniformly on j for I in the 
neighborhood of j/p. It follows that 



I 2 2 

-r( v = j/ P )e f > i3/p) e ~ e W 

3 ' a 



For a > converging to 0, 



oo oo ,. + 00 r+oo 

\ " -al 2 _ \ ~* 



> e- at = > / l { , l>aP} e-"&~2 J-e- u du 

fc-oo l=-oo J ° J ° 

Jo V a V « 

and by Stirling formula j\ ~ y/2Trji + 's e~i for large j, so that Qj ~ P(v = j/p)/p. 
It is then easy to deduce that X^>j % ~ — j/P) ^ or l ar g e 3- D 

The above Proposition implies that P(w > j) is such that P(w > j) ~ P(w > j/p) 
when the number of colors is large. This means that the tail of the distribution 
of the random variable v can be obtained by rescaling that of the number v of 
sampled balls with a given color. When v has a Pareto distribution, Equation (fT51) 
can still be used for large j to estimate the shape parameter a. The estimation of 
the probability 1 — E(e~ pv ) of sampling a color and the scale parameter 6 can also 
be estimated from the tail by using the expression of that probability as a function 
of b and a as in Equation (fl4|) . The same method applies for Weibull distributions. 



5.2. Probabilistic model. From now on, we consider the probabilistic model and 
we establish stronger results on the distance between F(v > j) and ¥{v > j/p), 
where v is the number of balls with a given color at the end of a trial. For this 
sampling mode, it was not possible to prove a result similar to Corollary [U but 
Berry-Essen's theorem [6] can be used to establish a stronger result for the compar- 
ison between v and v. In 5 , it is specifically proved that if we define the function 

hj(x) = x 2 /Ap 2 + Ajp/x 2 -Ij for x e K and j > 0, then 

P (v >j)-V(y> hi (VP(1-P)0) V fc) j < cE f -Ll {K >^ , 

where Q is a standard Gaussian random variable, for real numbers aV6 = max(a, 6), 
and c = 3(p 2 + (l— p) 2 )/\/p(l — p)- For small p, the constant c ~ 3/^/p. The above 
bound is very loose for small p and becomes accurate only for very large values of 
j. This is why we go further in this paper by establishing a tighter bound for the 
ratio F(v > j)/F(v > j/p). 

Let (B n ) be some sequence of i.i.d. Bernoulli random variables with parameter 
p and v some independent r.v. on N. Take some a e]l/2, 1[. Let v = J2i=i 

Theorem 3. For a 6 (1/2, 1), we have for all j > 1 

F(v > i) 

where 

Ax(J) < A{j) < A 2 (j) 



14 



CHRISTINE FRICKER, FABRICE GUILLEMIN, AND PHILIPPE ROBERT 



with 



( 



1 — exp 



P 



v>j/ P - l(j/ P ) a \) 



V V 



2a-l\ ^ 



/ 



'{v>j/p+[(j/p) a \+l) 



(v > i/p) 



Mj) P(v>j/p) 
and where B{j) is a positive quantity such that 

p(« > j/ P ) ' 



Proof. We have 



B(j) < e 

v>j)=F[Y l B t >j) = T 1+ T 2l 



where 



Ti = P [J2 B ^3,j<v<j/p-[(j/p) a ]-l ), 



1=1 



T 2 = P (J2 B * ^ 3, J/P - Iti/P) a \ < vj ■ 

Let us first recall the following inequality for the sum of independent Bernoulli 
random variables Bg, I > 1 [9]: for x £ [0,1 — p] 



(23) 



Bi — np > nx j < e ^< x > , 
U=l / 



where 
(24) 

It follows that for j < v < j/p 



A(x) = 2p(l -p) + ^x(l - 2p) - 2 -x 2 . 



52B t >j)<e 



It is easily checked that the function v — > vA [i — P) is increasing in the interval 
[j,j/p] and that for all v £ [j,j/p] 



vA\ 3 --p) <2j(l-p). 



Hence, for v £ [7, j/p] 



V" Be, > j I < e 23(i P - p) 
\fci / 



and for v £ [j,j/p- [(j/p) a \ - 1] 

P (^Bi>j\ ^e^^TT^lp) 2 
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This implies that 

T x < F[Y,B e >j,j<v<j/p-l(j/p) a \-l 



< 



<j/p-lU/p) a \-i 

Y, B e >j] ¥(v > j) 
1=1 



g 2(l-p 



For the term To. we first note that 



j(p) " W( v > j). 



T 2 <P(v>j/p- l(j/p) a \). 

Then, we clearly have 

T 2 >F[jTBz> 2,j j p + l(j/p) a \ + 1 < 



and then 



T 2 



> 



'j7p+L07p)°J+i 



i (v>j/ P + L(i/p) a J + i) 



P(v > j/p) 

Chernoff bound implies for v = j/p + \_(j /p) a \ + 1 

(pv - j) 2 



'(« > j/p) 



<j\ < exp 



2pv 



( 



< exp 



2 1 



(i)' 



2a-l 



It follows that 
P{v > j/p) 



> 



( 



1 — exp 



V 



V 

and the proof follows. 



2 1 



(v>2/p+l(j/p) a \+l) 
P(« > j/p) 



□ 



The above result can be applied to specific distributions for v, namely Pareto and 
Weibull distributions, in order to show that the tails of the probability distribution 
functions of v and pv are the same. This is the analog of Proposition [5] for the 
probabilistic model. 

Corollary 2. If v has either 

(1) a Pareto tail distribution with parameter a > 1 such that for x > 0, P(u > x) = 
L(x)x~ a where L is a slowly varying function, i.e., for each t > 0, 

y L (tx) , 

hl P TTT = 1; 
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(2) a Weibull tail distribution with (3 €]0, l/2[ such that for x > 0, P(v > x) = 
L(x)e~ Sx for some S > and L a slowly varying function 
then 

P(fi > j) 



lim 



'(« > j/p) 



1 



0. 



Proof. For (1), 



and 



(v > j) 



m r a 



m pa 



(v > j/p) L(j/p) (j/p) a L(j/p) j^+oo 



nv>j/p + e(j/p) a ) L((j/p)0- + <3/p) a - 1 )) n , ,., vp-is-g 
P(u > j/p) Mj/p) 
which tends to 1 when j tends to +oo. This implies that the quantities and 
A<i(ji) appearing in Theorem [3] tends to 1 and B(J) tends to when j — + oo. 
For (2), 

> j) = _m_ e - S f { x- P -?) 



f: 



P(« > i/p) ^(i/p) 

and it is straightforward that 

P(^> j/p + e(j/p) a ) = L(j/p(l + eQVp)"- 1 ^ -jfj/p+efj/pr^+Jfrf/rt" 
P(« > j/p) L(i/p) 

= XQ-Ml + eQVp)"- 1 )) _5/3 £ (i/ P )°+^- 1 (i+o(i)) 

which tends to 1 if a + (3 < 1. Let (3 s]0, 1[. It is sufhcient to find a €]l/2, 1[ such 
that a + (3 < 1. Necessarily 1 — /3>a>l/2 thus [3 < 1/2 and for such a /3, such 
an a exists. □ 



6. Concluding remarks on sampling and parameter inference 

We have established in this paper convergence results for the distribution of 
the number of balls with a given color under the assumption that there is a large 
number of colors in the urn, that the number of balls with a given color has a heavy 
tailed distribution independent of the color, and that only a small fraction p of the 
total number of ball is sampled. We have considered two ball sampling rules. The 
first one states that the probability of drawing a ball with a given color depends 
upon the relative contribution of the color to the total number of balls and that a 
drawn ball is immediately replaced into the urn. With the second rule, each ball is 
selected with probability p independently of the others. The two rules do not give 
the same results, even if they coincide when p — * (see [5] for details). 

From a practical point of view, we have shown that it is possible to identify the 
original distribution of the number of balls with a given color by using the tail of 
the distribution of the number of balls with a a given color drawn from the urn. 
A stronger result holds for Pareto when the number of colors is very large (see 
Proposition [3]) . This result is robust in practice because it does not rely on the 
asymptotics of the tail distribution (in Proposition assertions hold for all j > a). 

The determination of the original number of balls per color is valid when the 
number of balls follows a unique distribution of Pareto or Weibull type. This could 
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be used in the context of packet sampling in the Internet. In practice, however, 
the number of packets in flows is in general not described by a unique "nice" 
distribution, but can only be locally approximated by a series of Pareto distributions 
(see [2] for a discussion). More sophisticated techniques are then necessary to get 
the original statistics of flows. 
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